Part 1: PCA with penguins data

First some wrangling! Select columns we need, remove n/a since isn’t useful when looking at multivariate space, scale data to make sure no variable is over weighted in principle components just due to the units it is measured in.

Notice that body mass is in grams and values are in 1000’s compared with bill length in mm with values in 10’s

Notes:
* ends_with() is a helper function to select all variables ending with a certain string
* drop_na() drops all rows with n/a values, can write variables inside () to specifify which columns to drop n/a from
* scale() scales the data
* prcomp() runs principle components and changes your df into a list * autoplot() uses ggplot2 to draw a particular plot for an object of a particular class in a single command. For ex, for PCA data type it will assume you want a PCA biplot

penguins_pca <- penguins %>% 
  select(body_mass_g, ends_with("_mm")) %>% 
  drop_na() %>% 
  scale() %>% 
  prcomp()

penguins_pca$rotation # brings up the loadings for each of the 4 variables along that principle component
##                          PC1         PC2        PC3        PC4
## body_mass_g        0.5483502 0.084362920 -0.5966001 -0.5798821
## bill_length_mm     0.4552503 0.597031143  0.6443012 -0.1455231
## bill_depth_mm     -0.4003347 0.797766572 -0.4184272  0.1679860
## flipper_length_mm  0.5760133 0.002282201 -0.2320840  0.7837987
# Create biplot with autoplot(), doesn't have info about variable loadings, or label penguin spp!
autoplot(penguins_pca)

# Create new plot which contains info we can update aesthetics by, like spp and other variables
penguin_complete <- penguins %>% 
  drop_na(body_mass_g, ends_with("_mm"))
# Includes PCA data, and the observations used to make that PCA 
# Observations used to create PCA and data used for aesthetics MUST align!
autoplot(penguins_pca,
         data = penguin_complete, 
         colour = 'species',
         loadings = TRUE, 
         loadings.label = TRUE)+
  theme_minimal()
## Warning: `select_()` is deprecated as of dplyr 0.7.0.
## Please use `select()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Part 2: ggplot2 cutomization and reading in different file types

Read in an .xlsx file and do some wrangling

Notes:
* clean_names() default is to change all column headings to lower snake case * mutate() can also be used to transform existing columns, not just to create new ones
* across() is a helper function, can be used to say like across last 3 cols, or cols that end with…
* tolower() transform to lowercase * str_sub() extracts or replaces substrings (part of a string) from a character vector

fish_noaa <- read_excel(here("data", "foss_landings.xlsx")) %>% 
  clean_names() %>% 
  mutate(across(where(is.character), tolower)) %>% # Across any column where that col is a character, want to use function tolower() to change to all lowercase  
  mutate(nmfs_name = str_sub(nmfs_name, end = -4)) %>% # overwrites col since name nmfs_name is same name as existing column
  filter(confidentiality == "public")

Make a customized graph:

Notes:
* ggplotly() create interactive graph
* Can highlight using gghighlight() to highlight certain series or values

fish_plot <- ggplot(data = fish_noaa, aes(x = year, y = pounds))+ 
  geom_line(aes(color = nmfs_name), show.legend = FALSE)+ 
  theme_minimal() 

fish_plot
## Warning: Removed 6 row(s) containing missing values (geom_path).

ggplotly(fish_plot)
# Use gghighlight to highlight certain series 
ggplot(data = fish_noaa, aes(x = year, y = pounds, group = nmfs_name))+ 
  geom_line()+ 
  theme_minimal()+ 
  gghighlight(nmfs_name == "tunas")
## Warning: Tried to calculate with group_by(), but the calculation failed.
## Falling back to ungrouped filter operation...
## label_key: nmfs_name
## Warning: Removed 6 row(s) containing missing values (geom_path).

# Use gghighlights to highlight certain values 
ggplot(data = fish_noaa, aes(x = year, y = pounds, group = nmfs_name))+ 
  geom_line(aes(color = nmfs_name))+ 
  theme_minimal()+ 
  gghighlight(max(pounds) > 1e8)
## label_key: nmfs_name
## Warning: Removed 6 row(s) containing missing values (geom_path).

Read in data from a URL, use lubridate(), mutate(), make a graph with months in logical order

Notes:
* Can use read_csv() using url links, but must weigh benefits/ costs of not knowing exactly what the dataset looked like when you did analysis. May want to download hard copy
* mdy() from lubridate to convert to date
* month.abb[] function in base R to replace abbreviation of month name by number, month.name[] replaces with full name, could use case_when() to manually do the same

monroe_wt <- read_csv("https://data.bloomington.in.gov/dataset/2c81cfe3-62c2-46ed-8fcf-83c1880301d1/resource/13c8f7aa-af51-4008-80a9-56415c7c931e/download/mwtpdailyelectricitybclear.csv")
## 
## -- Column specification --------------------------------------------------------
## cols(
##   date = col_character(),
##   kWh1 = col_double(),
##   kW1 = col_double(),
##   kWh2 = col_double(),
##   kW2 = col_double(),
##   solar_kWh = col_double(),
##   total_kWh = col_double(),
##   MG = col_double()
## )
monroe_ts <- monroe_wt %>% 
  mutate(date = mdy(date)) %>% 
  mutate(record_month = month(date)) %>% #creates new column with just the month from the date column
  mutate(month_name = month.abb[record_month]) %>% # Create month name and not just the month number 
  mutate(month_name = fct_reorder(month_name, record_month))# change month to be an ordered factor so it won't show up alphabetically in ggplot 

ggplot(data = monroe_ts, aes(x = month_name, y = total_kWh))+
  geom_jitter()